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Abstract 

We give an algorithm for learning a mixture of unstructured distributions. This problem 
arises in various unsupervised learning scenarios, for example in learning topic models from a 
corpus of documents spanning several topics. We show how to learn the constituents (the topic 
distributions and the mixture weights) of a mixture of k (constant) arbitrary distributions over 
a large discrete domain [n] = {1, 2, . . . , n}, using O(npolylogn) samples. 

This task is information-theoretically impossible for fc > 1 under the usual sampling process 
from a mixture distribution. However, there are situations (such as the above-mentioned topic 
model case) in which each sample point consists of several observations from the same mixture 
constituent. This number of observations, which we call the "sampling aperture", is a crucial 
parameter of the problem. We show that efficient learning is possible exactly at the information- 
theoretically least-possible aperture of 2k — 1. (Independent work by others places certain 
restrictions on the model, which enables learning with smaller aperture, albeit using, in general, 
a significantly larger sample size.) 

A sequence of tools contribute to the algorithm, such as concentration results for random 
matrices, dimension reduction, moment estimations, and sensitivity analysis. 

1 Introduction 

We give an algorithm for learning a mixture of unstructured distributions. More specifically, we 
consider the problem of learning a mixture of k arbitrary distributions over a large finite domain 
[n] = {1,2, ... ,n}. This finds applications in various unsupervised learning scenarios including 
collaborative filtering [26], and learning topic models from a corpus of documents spanning several 
topics |36[ which we will use as our prototypical motivating example. Our goal is to learn the 
probabilistic model that is hypothesized to generate the observed data. In particular, we learn the 
constituents of the mixture, i.e., the k distributions defining the topics, and their weights in the 
mixture. 

It is information-theoretically impossible to reconstruct the mixture model from single-snapshot 
samples (e.g., single- word documents). Thus, our work relies on multi-snapshot samples. To 
illustrate, in the (pure documents) topic model introduced in [36], each document is consists of 
a bag of words generated by selecting a topic with probability proportional to its mixture weight 
and then taking independent samples from this topic's distribution (over words); so n is the size 
of the vocabulary and k is the number of topics. Notice that typically n will be quite large, and 
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significantly larger than k. Also, clearly, if very long documents are available, the problem becomes 
trivial, as each document already provides a very good sample for the distribution of its topic. 
Thus, it is desirable to keep the dependence of the sample size on n as low as possible, while at the 
same time minimize what we call the aperture, which is the number of snapshots per sample point 
(i.e., words per document). These parameters govern both the applicability of an algorithm and 
its computational complexity. 

Our results. Let p^, . . . G A"^^ denote the fe-mixture constituents, where A"~^ is the (n — 1)- 
simplex, and wi, . . . ,Wk denote the mixture weights. Our algorithm uses 

documents (i.e., samples) and reconstructs with high probability (see Theorem 14. ip each mix- 
ture constituent up to £i-error e, and each mixture weight up to additive error e. We make no 
assumptions on the constituents. The asymptotic notation hides factors that are polynomial in 
Wmin ■= mint Wt and the "width" of the mixture (which intuitively measures the minimum variation 
distance between any two constituents). The three terms in ([T]) correspond to the requirements for 
the number of 1-, 2-, and {2k — l)-snapshots respectively. So we need aperture 2k — 1 only for a 
small part of the sample. (Clearly, longer documents can be split into pieces that can be used as 
independent samples.) 

To put our bounds in perspective, notice importantly that we recover the mixture constituents 
within 11.1- distance e. With fixed aperture (independent of n), a sample size of Q{n) is necessary 
to recover even the expectation of the mixture distribution with constant £i-error. On the other 
hand, aperture 0((n + A;^) logn/c) is sufficient for algorithmically trivial recovery of the model with 
constant ioo error using few samples. Restricting the aperture to 2A; — 2 makes recovery impossible 
to arbitrary accuracy (without additional assumptions): there could be two far-apart /c-mixtures 
that generate exactly the same sample distribution. Thus, we obtain near-optimal dependence on 
n and optimal aperture. 

Our work provides new insights into the widely-studied problem of learning topic models and 
nicely complements the recent interesting work of [5] HI [3] . This body of work recovers the con- 
stituents (under certain assumptions) up to £2 or ioo error, using a sample size that is poly(A;) and 
independent of (or sublinear in) n. However, if we seek to achieve £i-error e, there are inputs for 
which their sample size (although poly(n. A:)) is r2(n^) (or worse, again ignoring dependence on 
If mill and "width"; see Appendix |A]). This is a significantly poorer dependence on n compared to 
our near-linear dependence (so our bounds are better when n is quite large but k is small, e.g., a 
constant). Observe that with Q{n'^) samples, the entire distribution on 3- word documents can be 
estimated fairly accurately; the challenge in O HI [3] is therefore to recover the model from this 
relatively noiseless data. In contrast, a major challenge that we face to achieve ^i-reconstruction 
with 0(n polylogn) samples is to ensure that the error remains bounded despite the presence of 
very noisy data due to the small sample size, and we need to develop suitable machinery to achieve 
this. (An interesting research direction would be to combine the various approaches to obtain a 
0(n- poly(/c, Inn)) sample size.) 

We now give a rough sketch of our algorithm (Algorithm [T] in Section [3]) and the ideas behind 
its analysis (Section H]). Let P = {p^ , . . . ,p^), r = J2t''^tP^ expectation of the mixture, 

and k' = rank(p^ — r, . . . ,p^ — r). Our algorithm reduces the problem to the problem of learning 
one- dimensional mixtures. We choose k' random lines that are close to the affine hull, aff(P), of 
P and "project" the mixture on to these k' lines. We learn each projected mixture, which is a 
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one-dimensional mixture-learning problem, and then combine the inferred projections on these k' 
lines to obtain k points that are close to aff(P). Finally, we project these k' points on to A"~^ to 
obtain k distributions over [n], which we argue are close (in £i-distance) to p^, . . . 

Various difficulties arise in implementing this plan. We first learn a good approximation to 
aff(i-*) using spectral techniques and only 2-snapshots. We use ideas similar to [32l[71[31], but our 
challenge is to show that the covariance matrix A = wt (p* — r) (p* — r )^ can be well-approximated 
by the empirical covariance matrix with only 0(n In^ n) 2-snapshots. A random orthonormal basis 
of the learned affine space then supplies the k' lines on which we project our mixture. Of course, 
we do not know P, so "projecting" on to a basis vector h actually means that we project snapshots 
from P on to 6 by mapping item i to h^. For this to be meaningful, we need to ensure that if the 
mixture constituents are far apart in variation distance then their projections {h'^p^)ti^[k] ^'^^ ^Iso 
well separated relative to the spread of the support {5i, . . . 6„} of the one-dimensional distribution. 
In general, we can only claim a relative separation of (since minj^j/ — ||2 may be 

G(^)). We avoid this via a careful balancing act: we prove (Lemma 14. 3p that the £00 norm of 
unit vectors in aff(P) is 0(-^), and argue that this isotropy property suffices since b is close to 
aff(P). 

Finally, a key ingredient of our algorithm (see Section [5]) is to show how to learn the real 
projections {h'^p^)t<^[k] from the projected snapshots. This is technically the most difficult step 
and the one that requires aperture 2k — 1 (the smallest aperture at which this is information- 
theoretically possible). We show that the projected snapshots on h yield empirical moments of a 
related distribution and use this to learn the projections and the mixture weights via a method 
of moments (see, e.g., [231 1221 [211 UHl [Ml [I])- One technical difficulty is that variation distance 
in A*^"^ translates to transportation distance [39] in the one-dimensional projection. We use a 
combination of convex programming and numerical- analysis techniques to learn the projections 
from the empirical "directional" moments. In the process, we establish some novel properties about 
the moment curve — an object that plays a central role in convex and polyhedral geometry [8] — that 
may be of independent interest. 

Related work. The past decade has witnessed tremendous progress in the theory of learning 
statistical mixture models. The most striking example is that of learning mixtures of high dimen- 
sional Gaussians. Starting with Dasgupta's groundbreaking paper [19], a long sequence of improve- 
ments [20l|6l[38l[29l[Il[22l[13] culminated in the recent results [28l[ini[33] that essentially resolve the 
problem in its general form. In this vein, other highly structured mixture models, such as mixtures 
of discrete product distributions [301 [H [13 (SS [HI [IS] and similar models [El [S [Ml |29l [H [El [2l] , 
have been studied intensively. One important difference between this line of work and ours is that 
the structure of those mixtures enables learning using single-snapshot samples, whereas this is im- 
possible in our case. Another interesting difference between our setting and the work on structured 
models (and this is typical of most results on PAC-style learning) is that the amount of informa- 
tion in each sample point is roughly in the same ballpark as the information needed to describe the 
model. In our setting, the amount of information in each sample point is exponentially sparser than 
the information needed to describe the model to good accuracy. Thus, the topic-modeling problem 
motivates the natural question of inference from sparse samples. This issue is also encountered in 
collaborative filtering; see [31] for some related theoretical problems. 

Recently, we learnt about an independent line of inquiry into very much the same question as 
ours [5l [H [3)^1 . All three papers make certain assumptions about the mixture constituents, which 

^ An earlier stage of this work, including the case of = 2 topics as well as some other results that are not subsumed 
by this paper, dates to 2007. The last version of that phase has been posted since May 2008 at [37] ■ The extension 
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makes it possible to learn the mixture constituents with constant aperture. In comparison with 
our work, their poly(n, k) sample size (for £i-error) is attractive in terms of k but has a worse 
dependence on n {Q,{n^)). 

The assumptions in [Sj HI [3] impose some limitations on the applicability of their algorithms. 
To understand this, it is illuminating to consider the case where all the p^s lie on a line-segment in 
j\n-i^ This poses no problems for our algorithm, and we recover the p*s along with their mixture 
weights. However, as we show below, the algorithms in OHIE] all fail to reconstruct this mixture. 
Anandkumar et al. [4j solve the same problem that we consider, under the assumption that P 
(viewed as an n x A; matrix) has rank k. This assumption is clearly violated in this setting, rendering 
their algorithm inapplicable. The other two papers [5^ [S] deal with the more general setting where 
each document is generated by a combination of topics [36^125): first a convex combination A G A'^^^ 
is sampled from a mixture distribution T on A'^^^, then the document is generated by sampling 
words from the distribution "^21=1 ^tP*- The goal is to learn the topic distributions and the mixture 
distribution. (The problem we consider is the special case where T places weight on the t-th. 
vertex of A'^'^"'^.) [5j posits a p-separability assumption on the topics, wherein each topic p* has a 
unique anchor word i such that p\'> p and p\ = Q for every t' ^ t, whereas [3] weakens this to the 
requirement that the p*s be linearly independent. Both papers show how to learn the model when 
T is the Dirichlet distribution (which gives the latent Dirichlet model [E]); the paper [5j obtains 
results for other mixture distributions as well. 

In order to apply these algorithms, we can view the input as being specified by two topics, x and 
y, which are the end points of the line segment; T then places weight wt on the convex combination 
(Ai, 1 — \t)\ where = \tx + (1 — \t)y- This T is far from the Dirichlet distribution, so [1] does 
not apply here. Suppose that x and y satisfy the p-separability condition. (Note that p may only 
be O(^), even if x and y have disjoint supports.) We can then apply the algorithm of Arora et 
al. [5]. But this does not recover T; it returns the topic-topic correlation matrix E7-[AA^^], which 
does not reconstruct the mixture {w,P). 

This limitation should not be surprising since uses constant aperture. Indeed, [5] notes 
that it is impossible to reconstruct T with arbitrary accuracy (with any constant aperture) even 
if one knows the topics x and y. In this context, we remark that our earlier work [S^ uses the 
approach presented in this paper and solves the problem for documents that are arbitrary mixtures 
of two topics, yielding a crisp statement about the tradeoff between the sampling aperture and the 
accuracy with which T can be learnt. 

Finally, it is also pertinent to compare the topic modeling problem with the problem of learning 
a mixture of product distributions (e.g., [23]). Multi-snapshot samples can be thought of as single- 
snapshot samples from the power distribution on [n]^ , where K is the aperture. The product 
distribution literature typically deals with samples spaces that are the product of many small 
cardinality components, whereas the topic modeling problem deals with samples spaces that are 
the product of few large cardinality components. 

2 Preliminaries 

2.1 Mixture sources, snapshots, and projections 

Let [n] denote {1,2, . . . ,n}, and A"^-*^ denote the (n — l)-simplex {x £ M" : X^j ~ ^l- ^ ^' 
mixture source (w,P) on [n] consists of k mixture constituents P = {p^,p^^ . . . ,p^), where has 
support [n] for all t £ [k], along with the corresponding mixture weights w = {wi, . . . ,Wk) G A'^"-^. 

to arbitrary k is from this year. 
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An m-snapshot from (w, P) is obtained by choosing t £ [k] according to the distribution w, and 
then choosing i £ [n] m times independently according to the distribution p*. The probability 
distribution on m-snapshots is thus a mixture of k power distributions on the product space [n]™. 

We also consider mixture sources whose constituents are distributions on M. A k-mixture source 
{w, P) on M consists of k mixture constituents P = {p^ , . . . ,p^), where each is a probability 
distribution on M, along with corresponding mixture weights w = {wi, . . . , w^) G A*^~^. 

Given a distribution p on [n] and a vector x G M", we define the projection of p on x, denoted 
TTxip), to be the discrete distribution on M that assigns probability mass Ylrxi=f5 Pi to 13 £ R. (Thus, 
TTxip) has support {xi,...,x„} and E[7r3;(p)] = x^p.) Given a fc-mixture source {w,P) on [n], we 
define the projected /c-mixture source (t(;,7r^(P)) on M to be the A;-mixture source on M given by 
(w;, (7r^(pi),...,7r^(/))). 

We also denote by {w, E[tTx{P)]) the distribution that assigns probability mass wt to E[7ra;(p*)] = 
x^p* for all t £ [k]. This is an example of what we call a k-spike distribution, which is a distribution 
on M that assigns positive probability mass to k points in M. 

2.2 Transportation distance for mixtures 

Let (^w, {p^, . . . ,p^)) be a fc-mixture source on [n], and (^w, {p^, . . . be an ^-mixture source on 
[n]. The transportation distance (with respect to the total variation distance ^||x — y||i on measures 
on A"^^) between these two mixture sources, denoted by Tran(u), -P; fi), -P), is the optimum value 
of the following linear program (LP). 

^ ^ ^ fc 

min • -||p* — p'lli s.t. ''^^Xij = Wi \/i £ [k], ''^^Xij = Wj \/j £ [£], x > 0. 

i=l j = l j = l i=l 

The transportation distance Tran(w,a;w,a) between a A:-spike distribution (^w,ct = (ai, . . . ,ak)) 
and an £-spike distribution (zZ), a = (ai, . . . , a^)) is defined as the optimum value of the above LP 
with the objective function replaced by J2ie[k] je[e]^^j\'^''' ~ Observe that if we view {w,a) 
equivalently as a /c-mixture source [w, (/^, . . . , f'^)) on {0, 1} with = at, and {w, a) similarly as 
an ^-mixture source on {0, 1}, then Tran{w,d;w,a) is simply the transportation distance between 
these k- and ^-mixture sources on {0, 1}. 

2.3 Perturbation results and operator norm of random matrices 

Definition 2.1. The operator norm of A (induced by the £2 norm) is defined by ||^||op = 
max^.^0 ^^^x\\^2 ' '^^^ Frobenius norm of ^ = (Aij) is defined by \\A\\f = \jYli,j j- 

Lemma 2.2 (Weyl; see Theorem 4.3.1 in |27j). Let A and B benxn matrices such that \\A—B\\op < 
p. Let Xi{A) > . . . > A„(^), and Xi{B) > . . . > Xn{B) be the sorted list of eigenvalues of A and B 
respectively. Then \Xi{A) — Xi{B)\ < p for all i = 1, . . . ,n. 

Lemma 2.3. Let A,B benxn positive semi-definite (PSD) matrices whose nonzero eigenvalues 
are at least e > 0. Let Ha cLnd Hb be the projection operators onto the subspaces spanned by 
the eigenvectors of A and B respectively having nonzero eigenvalues. Let \\A — B\\op < p. Then 
Pa -nsllop < V^p/e- 

Proof. Note that AUa = A, iP^ = Ua, BHb = B, and 11^ = XI^. Let x be a unit vector. Then 
\\{A — B)x\\ < p and, since Hb is a contraction, \\{A — B)IIbx\\ < /5||nBx|| < p. Now note that 
{A — B)IIbx = AHbx — Bx so by the triangle inequality, we have H^IIbx — Ax\\ < 2p. Now we 
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can also write AUbx — Ax = ^4(115 — n^)x = 74(11^11^ — IIa)x. Since A here is acting on a vector 
that has already been projected down by Ha, we can conclude 

2p\\AUbx - Ax\\ = \\A{UaUb - Ua)x\\ > elKn^n^ - Ua)x\\. 

Thus, 2p/e > \\{IIa — 'HaHb)x\\. By the symmetric argument we also can write 2p/e > \\{IIb — 
IIbIIa)x\\. Adding these and applying the triangle inequality we have 

4p/e > II (Ha - IIaUb + IIb- nBllA)x\\ = || (n^ - UaUb - UbHa + n|)x|| = II (Ha - nB)2x|| . ■ 

Theorem 2.4 (|40j). For every p > 0, there is a constant k = k{p) = 0{p) > such that 
the following holds. Let Xi^j^l < i < j < n be independent random variables with \Xij\ < K, 
E[Xjj] = 0, and Var(Xjj) < cr^ for all i,j G [n], where a > K^n^^/'^Klv? n. Let A be the 
symmetric matrix with entries Aij = ^min(j,j),max(ij) ^ M- Then, Pr[||j4||op < 2a^/n + 

«;(i^cj)V2ni/4inn] > 1 - n"^. 



3 Our algorithm 

We now describe our algorithm that uses 1-, 2-, and {2k — l)-snapshots from the mixture source 
{w,P). Given a matrix Z, we use Span(Z) to denote the column space of Z. Let r = X^^^i tf^tP* 
denote the 1-snapshot distribution of {w,P). Let M be the nxn symmetric matrix representing the 
2-snapshot distribution of {w, P); so Mjj is the probability of obtaining the 2-snapshot (i, j) G [n]^. 
Let R = rr^. 

Proposition 3.1. M = XltLi wtp^p^^ = R + A, where A = Ylt=i '^t{p^ — r){p^ — r)"!". 

Note that M and A are both PSD. We say that [w, P) is Q-wide if (i) ||p — g||2 ^ for any two 

distinct p,q G P; and (ii) the smallest non-zero eigenvalue of A is at least (Note that (i) holds 
if miup^qgp^p^g \\p — q\\i > Q.) We assume that Wmin '■= min^ > 0. Let k' = rank(74) < fc — 1. It 
is easy to estimate r using Chernoff bounds (see, e.g., pj). 

Lemma 3.2. For every p G N and every a > 0, if we use N > ^^^J^-nlnn independent!- snapshots 
and set fi to be the frequency of i in these 1-snapshots for all i G [n], then with probability at least 
1 — the following hold. 

(1 — a)ri < "Tj < (1 + a)ri Mi with Vi > < (1 + a)a /2n Vi with Vi < (2) 

It will be convenient in the sequel to assume that our mixture source {w, P) is isotropic, by 
which we mean that tj- < ri < - for all i G fnl; notice that this implies that < — - — for all 
i G [n]. We show below that this can be assumed at the expense of a small additive error. 

Lemma 3.3. Given an estimate r satisfying ([2]), if we can learn the constituents of an isotropic k- 
mixture source to within transportation distance e, then we can learn the constituents of the original 
k-mixture source to within transportation distance e + 4(T. 

Proof. Given an arbitrary mixture source for which we have computed the estimate f, consider the 
following modification of the mixture constituents. Let a < 1/32. We eliminate items i such that 
fi<^. Each remaining item i is "split" into [nfi/a\ items, and the probability of i is split equally 
among its copies. We can sample m-snapshots from the modified mixture source as follows. We 
eliminate snapshots that include an eliminated item. With probability at least 1 — n~^, we have 
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rj < ^ if i is eliminated, and §5 < ^ < H otherwise. So the total weight of eliminated items is at 
most 4(7 and the probability that an m-snapshot survives is at least (1 — 4a)"^; we can take u <C 
(Recall that we are aiming for m < 2k — 1.) In the surviving snapshots, each item i in the original 
snapshot is replaced by one of its [nri/a\ copies, chosen uniformly at random (and independently 
of previous such choices) . 

Abusing notation, let be the modified mixture constituents, and r denote the 

distribution of the modified 1-snapshots. But f still refers to the original mixture. 

It is easy to show that the modified mixture source is isotropic. The number n' of new items is at 
most ^ and at least E.n>2<x/n | • ^ > § • ^ (1 " 2a) > Let = E.n>2./n n>l-4a> 7/8. 
With probability at least 1 — n~^, for every new item ii obtained by splitting item i, we have 
r., > ^ > ii > 2^ and r,, < i • I • ^ < il^ < I,. Note that letting f,, = fj[nr,/a\, 
we have that (1 — cr)ri^ < fi^ < (1 + o')^^^; thus, we immediately have a good estimate of r for the 
modified mixture source. 

If we learn the modified mixture source, we can revert to the original mixture source by aggre- 
gating for each constituent of the mixture the probabilities of the items that we split, and setting 
the probability of eliminated items to 0. This degrades the quality of the solution by the weight of 
the eliminated items, which is at most an additive 4(t term in the transportation distance. ■ 



An overview. Our algorithm for learning an isotropic A;-mixture source on [n] takes three pa- 
rameters: C ^ 1 such that (w, P) is ^-wide, w S N, which controls the success probability of the 
algorithm, and 6 G (0,1), which controls the statistical distance between the constituents of the 
learnt model and the constituents of the correct model. For convenience, we assume that 6 is suffi- 
ciently small. The output of the algorithm is a fc-mixture source (w, P) such that with probability 
1 — O(^), \\w — tt)||oo and — for all t S [k] tend to as 5 — )• (see Theorem 14. ip . 

The algorithm (see Algorithm [1]) consists of three stages. First, we reduce the dimensionality of 
the problem from n to k' using only 1- and 2-snapshots. By Lemma [3. 2 1, we have an estimate r that 
is component- wise close to r. Thus, R = rf^ is close in operator norm to R. So we focus on learning 
the column space of A for which we employ spectral techniques. Leveraging Theorem 12.41 we argue 
(Lemma l4.2p that by using 0{nln^ n) 2-snapshots, one can compute (with high probability) a good 
enough estimate M of M, and hence obtain a PSD matrix A such that \\A — A\\op is small. 

The remaining task is to learn the projection of P on the affine space f -|- Span(^), and the 
mixture weights, which then yields the desired fc-mixture source {w,P). We divide this task into 
two steps. We choose a random orthonormal basis {61, . . . , 6^'} of Span(^). For each bj, we consider 
the projected fc-mixture source {w, tti,. (P)) on M. One of our contributions is a procedure we devise 
in Section[5]to learn the corresponding fc-spike distribution (w, E[7r;,^. (P)]) using {2k — l)-snapshots 
from (w, TTf,. (P)) (which one can easily obtain using {2k — l)-snapshots from (w, P)). Applying this 
procedure (see Lemma 1^71) . we obtain weights wl, . . . jwl. and k (distinct) values a\, . . . ,al. such 
that each true spike {wt,bjp'^) maps to a distinct inferred spike (zZ;^^ , a^^ . 

The final step is to match up aj and Cfc' for all j = 1, . . . , A;' — 1, and thus obtain k points in f-|- 
Span(j4) that are close to the projection of P on f-|-Span(A). For every j = 1, . . . , k' — l, we generate 
a random unit "test vector" Zj in Span(6j, 5^/) and learn the projections z^jp^, . . . , zjp^. Since {w, P) 
is C-wide, using standard results about random projections, and the guarantees obtained from our 
A:-spike learning procedure, we can argue that z'^-{cx{j)j + a^'^hki) is close to some value in {2:jj'*}te[A;] 
iff there is some t such that (x{^ and a^^' are close respectively to b^-pt and (see Lemma l4.8p . 
Thus, we can use the learned projections of {z^jP^}t(^[k\ to match up {at}ie[fc] and {a^ }tG[fc]- 
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Algorithm 1. Input: an isotropic (-wide k-mixture source {w,P) on [n], and parameters uj > and 6 > 0. 
Output: a k-mixture source {w,P) on [n] that is "close" to {w,P). 

Define T Sujk'^, H = ^2 and L = giZJiTspTH- assume that S < -^a^&lu ■ Let k = k(2 + Inw) 

be given by Theorem \2.4\ ' we assume k > 1 for convenience. Define c — ^^^^2 ■ In-(i)- assume that 



2 > 240k In-' -' n 
"'mill ^ 

Al. Dimension reduction. 

Al.l Use Lemma [3.21 witli /j, = 2 + Inw and a — to compute an estimate f of r. Set R = rr'' . 

A1.2 Independent of all other random variables, choose a Poisson random variable N2 with expectation 
E[iV2] = on In® n. Choose N2 independent 2-snapshots and construct a symmetric n x n matrix 
M as follo-ws: set Mi^i = frequency of the 2-snapshot in the sample for all i g [n], and 

Mij = Mj^i = half the combined frequency of 2-snapshots {i,j) and (j, i) in the sample, for all 
hj e [n],i ^ j. 

A1.3 Compute the spectral decomposition M — R = X]"=i ^iVivj -where Ai > . . . > A„. 
A1.4 Set A = Ei:A,>cV2n ^i^M- Note that A is PSD. 
A2. Learning projections of {w,P) on random vectors in Span(A). 

A2.1 Pick an orthonormal basis B = {bi, . . . , bk'} for Span(A) uniformly at random. 
A2.2 Set (w^ a^) ^ Learn(5j, (5, ^) for all j = 1, . . . , fc'. 

A3. Combining the projections to obtain {w, P). 

A3.1 Pick 9 S [0, 27r] uniformly at random. 

A3. 2 For each j = 1, . . . , fc' — 1, -we do the foUo-wing. 

- Set Zj — bj cos 6 + bk' sin 9. 

- Set {w^ ,a^) ^ Learn{zj,S,-^). 

- Foreachti,t2 S [fc], if there exists f G [k] such that \{al^bj a^^bk'Y zj — a-j.^ — iV^+l)L/{2-\- 
5T) then set ^-'■(^2) = ti. 

A3.3 Define g''' (t) = t for all t £ [k]. 

A3. 4 For every t e [k]: define wt = {J2j=i ^gj(t))/''''! define p*- = f + {'^^go(t) ~ ^^'^ 

be the point in A"~^ closest in li norm to (which can be computed by solving an LP). Return 

Algorithm Learn(w, e) 

Input: a unit vector v G Span(A), and parameters <j > 0, £ > 0. We assume that (a) \v^{p — q)\ > L ioi all 
distinct p,q G P; and (b) 1024A:(^ < ^^^^fg^. 

Output: a fc-spike distribution (w, (71, . . . , 7^)) close to (w, E[7r^,(P)]). 
LI. Solve the minimization problem 

minimize ||a;||oo s.t. v^x>l-^, |la;||2 < 1 (Qu) 

which can be formulated as a convex program, to obtain a vector x*; set a = We prove in 

Lemma BTH that ||a||oo < H and |a^(p — q)\ > L/2 for every two mixture constituents p,q £ P. 

L2. Let s = i;^'^. Apply the procedure in Section [5] leading to Theorem 15.11 for (w, 7r„/2ff (-P)) to infer 
a fc-spike distribution (iS,/3) that, with probability at least 1 — e, is within transportation distance 
0(s"(i/fe)) from {w,E 

[T^a/2HiP)])- This uses a sample of (2fc— l)-snapshots of size 3fc2**'^s '^'^ ln(4fc/e). 
L3. For every t g [k], set 74 = [2H f3t){a^ v) . Return (w),7). 
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4 Analysis 



We prove the following theorem. 

Theorem 4.1. Algorithm{l\ uses 0(^^ • nlnn) 1-snapshots, 0{^^^—^^^^^ ■ nln^n) 2-snapshots, 

mill 

and 0(^|p- • ln(24a;/i;^)) (2/c — l)-snapshots, and computes a k-mixture source {w^P) on [n] such 
that with probability 1 — 0(;^), there is a permutation a : [k] [k] such that for all t = 1, . . . ,k, 



\wt-Wa{t)\ = 0( a ) and - p'^W ||i = o( 



Hence, Ttan^w, P]!!), P) = O 



k6 



The roadmap of the proof is as follows. By Lemma 13.21 with probability at least 1 



(1 - < < (l + {g)ri for aU i e [n]. We assume that this holds in the sequel. In Lemma [4. 2 ^ 
we prove that the matrix A computed after step Al is a good estimate of A. In Lemma 14.3^ we 
derive some properties of the column space of A. Lemma 14.41 then uses these properties to show 
that algorithm Learn returns a good approximation to (tf, E[7r^(P)]). Claim [^31 and Corollary 14.61 
prove that the projections of the mixture constituents on the bjS and the ZjS are well-separated. 
Combining this with Lemma [4.41 we prove in Lemma [4. 71 that with suitably large probability, every 
true spike {wt, bjp^) maps to a distinct nearby inferred spike on every bj, j E [k'], and similarly every 

true spike {wt,zjp^) maps to a distinct nearby inferred spike on every Zj, j € [k' — 1]. Lemma 14.81 
shows that one can then match up the spikes on the different bjs. This yields k points in Span(j4) 
that are close to the projection of P on Span(^). Finally, we argue that this can be mapped to a 
/c- mixture source (w^P) that is close to {w,P). 

Lemma 4.2. With probability at least 1 — the matrix A computed after step Al satisfies 
rank(A) = k' = rank(A) and ||^ — ^||op ^ 



5 

op — n ■ 



Proof. Recall that k' = rank(^). Let B = M — R = Ya=i ^i'^^ivj, where Ai > . . . > A„. We prove 
below that with probability at least 1 — we have ||M — M||op < and — i?||op < This 
implies that ||^— i?||op < ||M — M||op + ||i?— i?||op < Hence, by Lemma [2. 2 1 it follows that by the 
(^-wide assumption, X^' > ^ — 2^ > and |Aj| < < for alH > A; . Thus, we include exactly 
k' eigenvectors when defining A, so rank(^) = k'. Since A is the closest rank-A;' approximation in 
operator norm to B, we have ||^ — ^||op < ||^ — -B||op + — ^||op < 2||^ — -B||op < ^• 

We now proceed to bound ||M — M||op and — -R||op. It is easy to see that \Rij — Rij\ < Sarij, 
where a = (5/48, and so \\R - R\\op < \\R - R\\f < ^• 

Bounding 11 M — MlLp is more challenging. Note that Mi < min|-, — - — j| due to isotropy. 

Let K = iiilMi. Let D = N2 ■ {M - M) . Let Xf . = 1 if the i-th snapshot is {i,i), for i G 
[n], and for i,j € [n],i 7^ j, let X^j = = ^ if the i-th 2-snapshot is or {j,i), and 

otherwise. Let Yfj = X^^ - Mij = X^^ - E[X^]; so A,j = ^tj for all i, j G \n\. We have 

o-2(n2) := Var[A,i|A^2 = "2] = n2^Qx{X} \ < n2E[(X/,)^] < n2Mij. For n2 < 2cnln^n, we have 
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a'{n2) < < ^^ili^^ (since > ^i^iMi^iSlii). So by Bernstein's inequality, 



Scln'^n ^ lnnln(l/<5) (^-^^^ ,„4 ^ 57600k^ In^n ^^ 

K'^ In^ n 



mm 



Pr[|Aj| > i^lnn|A^2 = ^2] < 2exp 



2(cj2(n2) +Klnn/3)^ 
< 2max{exp(-^),exp(-MiM.)} < ^. 

Since Pr[iV2 > 2cln^n] < n ^, we can say that with probabihty at least 1 — 2n ^, we have 
< Klnn for every i,j G [n] and iV2 < 2cln^n. 
Define a matrix Z)' by putting, for every i,j S [n], D^^- = sign(Dj ,,) • min||Z)j j|, Inn}. Put 
D" = D' — FjID']. Clearly, £[1?^'^] = for every i,j £ [n]. The entries of D are independent random 
variables as is a Poisson random variable; hence, the entries of D" are also independent random 
variables. Also Var[D"j] < Var[Z)jj] since censoring a random variable to an interval can only 

reduce the variance. Note that Dij = ^/j follows the compound Poisson distribution. So we 

have 



Var[Aj] = E[7V2] • E[(y,^,.)'] = E[iV2] • Var[Xi^.] < E[A^2]Mi,,- < 



< 



where c = max| ^ , Thus, by Theorem 12.41 the constant k = k(2 + Inw) > is such that 
with probability at least 1 3- 



,„//,, cK In n ^ \ , cXln n , / r\ -x 

ID op < 2 ^ ^^KOKXnn — Inn < (IKc^kK^Tc) In^n. 

Jn \\ Jn 



We have Pr[iV2 > \ E[iV2]] > 1 - rr'^ , Thus, with probability at least 1 - ^, we have that ISS^ > 
\ E[iV2], B' = D, and ||D"||op is bounded by (j3]). We show below that 2\\E[D']\\op/ EfiVs] < 66n-^ < 
6/20n. One can verify that 4Kc/c < 6/10 and 2KK^/d/c < 6/10. Therefore, with probability at 
least 1 - ^, we have that \\M - M||op = ^ • \\D\\op < ■ {\\D"\\op + \\ E[I?']||op) < i^- 

Finally, we bound || E[L>'] ||op. We have ||E[L>']||op < ||E[L>']||f = \\E[D' - D]\\f < n- 
maxjjE[|D^j — -Djjl]- Let /x = cnln^n = E[A''2]. Fix any For any n2 < 21n(l/5)yLt, we 

have \ai[Di^j\N2 = n2\ < n2Mij < ^^^'^CV'^) in Bernstein's inequality, we have that 



Pr[|Aj| > Klnn|iV2 < 21n(l/5)/i] < 25n-^ Also, iD'^j - Ajl < N2 always. Therefore, 

E[p^,-A.||A^.=n2] if n.<21n(i),; 

I n2 otherwise 

and E[\D'ij - < A* - Pr[iV2 < 2 ln(l/5)^] E[iV2|iV2 < 2ln{l/6)fi]{l - 26n-^). Since N2 is 
Poisson distributed, we have 



Fr[N2<2ln{l/6)f,]E[N2\N2<2ln{l/6)f,] = ^ ^ " = ^ E 

£=0 ' e=o 



L21n{l/5)/.J f, _ L21n(l/<5)MJ-1 (, _y 

IT 

> ^lVT[N2<\n{l/6)^JL] > n{l-6n-^) 



Thus, E[|L>^_^.-A,il] < /i-/i(l-5n-3)(i_25n-3) < 3(^n- V, and 2|| E[L>'] ||op/ E[iV2] <66n-^. ■ 

We assume in the sequel that the high-probability event stated in Lemma 14.21 happens. Thus, 
Lemma \T3\ implies that \\IIa — LI^Ilop < 
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Lemma 4.3. For every unit vector b € Span(j4), ||6||oo < 



Proof. Recall that A = X]t=i ~ r){p^ — r)^ , and the smallest non-zero eigenvalue of A is at 
least C'^/n. Note that Span(yl) = Spanjp^ — r, . . . — r}. Let Z = conv(P). As all the mixture 
constituents p ^ P satisfy ||p||oo < „, ^ ^ , for any x G Z, clearly ||x||oo < „, ^ „ Also since x > and 
Iklloo < we have \\x — r\\c^ < — - — . So if r + b ^ Z, then < — - — . Otherwise, let the line 

/||2 ^ C^^iin 



segment [r, r + b] intersect the boundary of Z at some point b' . We show that ||r — b'\\2 > 
The lemma then follows since b = {b' — r)/\\b' — r\\2 and so ||6||oo = < —3-7^- 

II l|2 ^min^v^ 

Let S be a facet of Z such that b' £ S, r ^ S (there must be such a facet since r is in the strict 
interior of P as Wjam > 0). Since Z C Span(A), one can find a unit vector v G Span(A) such that S 
is exactly the set of points that minimize v^x over x € Z. Let dL = f V— min^g^ v^x = v^{r—b'). We 
lower bound ||r — 6'||2 by d^. Note that di > 0. Clearly, v^{p^ — r)> —di for all t G [k]. Projecting 
P onto V, we have that (a) Ylt=i '^t''^^ (P*^ — r) = 0; and (b) v^Av = Ylt=i'^t (^^(p* ~ ^))'^ ^ 
since v G Span(^) and {w,P) is C-wide. Let Wl = Z]t:i,t(pt_r)<o = ^ ~ - ""^mm, and 

let dR = uiaM^Hv' - r)}. Then, = ^Li wtv^p' - r) > W^-dL) + w^^indn, so dR<dL-^. 



Also $ < ELirn ivHp'-r)f < WL-dl + Wn-dl < WL-dl + Wn-dl-^ < So 



mm mm 



dj > ■ 

Lemma 4.4. If the assumptions stated in Algorithm Learn are satisfied, then the vector a computed 
in Learn satisfies ||a||oo < H, and \a\p — q)\ > L/2 for every two mixture constituents p,q ^ P. 
Hence, with probability at least 1 — e, the output (w,^) of Learn satisfies the following: there is a 
permutation cr : [fc] 1— )• [k] such that for all t = 1, . . . ,k, 

, ^/crwi-^fcS , X , , 8V26 2M%kHq L 
\wt-w„it)\ = 0[-^^), Wp -7.wI < — — + . ^ /77 - ^7" + 

'^min'' "^min 'i'min^V'^ "^mm 

Proo/. We have v^YiA{v) = I - \\v - IIa{v)\\1 = 1 - II (n^ - ^a)v\\1 > 1 - • Thus, IIa{v) is 
feasible to ([Q^, and since ||n^(u)||2 < 1, by Lemma lL3l the optimal solution x* to dQ^P satisfies 
||x*||oo < ||nA(i^)||oo < H/2. Also ||x*||| > i;tx* > 1 - J| > |, so ||a||oo < H. Note that 
||f — alll = 2(1 — v'^a) < 2(1 — v^x*) < |f . It follows that for any two mixture constituents p,q, we 
have 

|a'^(p- g)| > \vHp-q)\ - \{v - a)^p-q)\ > \v^p - q)\ ^||p- gib 



I |, 8^25 I |, L L 

> \v\p-q)\ > W{p-q)\ - - >-. 

So any two spikes in the A;-spike mixture {w, E[TTa/2H{P)]) are separated by a distance of at least 
L/AH. Since s < L/4H, Theorem 15.11 guarantees that with a sample of {2k — l)-snapshots of size 
3/c2^'^s~^^ log(2A:/e), with probability at least 1 — e, the learned A;-spike distribution {w,l3) satisfies 
TT:an{w,E[TTa/2H{P)];w,p) < W24ks^/^'^'''^ = 1024/c? < Notice that this implies that there 

is a permutation a : [k] [k] such that = 1, . . . , /c: 

1/ / i r. I 1024^^^ L , , kq \ ^f(;oj^-^k^\ , , 

|W2iJ)V-fl,,,,|<__<-, |„„-«„„|=0(^)=0(^). (3) 

Fix some t G [k]. Let t' = a{t). From ([3]), we know that |aV - 2H ■ I3t'\ = We bound 

|f^p* — a^p*| and \2Hj5ti — 7t'|, which together with the above will complete the proof of the lemma. 
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We have |(t; - a)V| < 11?; - a||2||p*||2 < ^4^- Since -ft' = {2HI3t')a'^v and \l3t'\ < h we have 
\2HPt> -jt,\ < ^ < , ^ . It fohows that bV-7t,| < mskHi + ^V^ < mMil+L^ ^ 

Claim 4.5. Let Z be a random unit vector in Span(JL) and v G Span(Ji). Pr[|Z^i;| < 2oL"-^fc4 ] ^ 

Proof. One way of choosing the random unit vector Z is as foUows. Fix an orthonormal basis 
{ui, . . . ,Uk'} for Span(A). We choose independent A^(0, 1) random variables Xi for i S [k']. Define 
C = Ya=i ^i^i ^'^d set Z = C/\\C\\2. Set ai = ^"^"^ 02 = 2 + '^'"^^^'^^ ^ ^ < QGw/c. 

Note that C'<v/\\v\\2 is distributed as iV(0, 1). Therefore, Pr[|Ctz;| < ||f||2^] < < 
^r^. Also, IICII2 = Yli=i^i follows the xi' distribution. So 



Pr[||C||2 > a2k'] < {a2e^-''A^'^ < exp((l - 02/2)^72) < 



Observe that -^g ^ ■i2uA- ' 'k'^ ' ^o the "bad" event stated in the lemma happens, then \C'^v\ < 
lbl|2\/oi or IICII2 > a2k' happens; the probability of this is at most ■ ■ 

Lemma 4.6. With probability at least 1 — for every pair p,q £ P, we have (i) \b^j{p — q)\ > L 
for every j G [A;'] and (ii) \z^j{p — q)\ > L for every j G [k' — 1]. 

Proof. Define p = Il^{p) for a mixture constituent p. Clearly, for any v G Span(yl), v^p = v^p. 
Recall that Hn^-n^l < So for every p,qeP, \\p-q\\l > ||p - - IKHa - n^)(p - > 
Hp ~ hence, \\p — q\\2 > 2^' -l^otice that the zj vectors are also random unit vectors in 

Span(j4). Applying Claim H3] to each event involving one of the {&j}je[A;']i {^j}je[k'-i] random unit 
vectors, and one of the (2) vectors \\p — q\\ for p,q £ n^(P), and taking the union bound over the 
at most k'k"^ such events completes the proof. ■ 

Lemma 4.7. With probability at least 1 — the k-spike distributions obtained in steps A2 and 
A3 satisfy: 

(i) For every j G [A;'], there is a permutation : [k] 1— > [k] such that for all t G [A;], 

I ~? I ^/'Su}'^-^k^\ nf f i , ^ \ , . L 

l^t - K^U) \=0( —0—79 ' \Kp - ^i^m \=0( ^ 1 .5 . r- and is at most 



^<inC^^' ' ^'^ '^^W ^<UV^^ ............ 2 + ST- 

Hence, \al^ — a^^l ^ 1+0^4/r f*^^ distinct ti,t2 G [A;]. 

(ii) For every j G [k' — 1], for every t G [A;], there is a distinct t' such that 

I -i, ^f6uj^-^k^\ , i t -i , ^/ ^ \ , • L 

\wt - ur.,\ = O ( ^ — -5- , \z'-p - ai\ = 0[ , . and is at most - — — . 

min^ minS V ^ i 

Proof. Assume that the event stated in Lemma 14.61 happens. Then the inputs to Learn in steps 
A2 and A3 are "valid", i.e., satisfy the assumptions stated in Algorithm Learn. Plug in = 5 and 
e = gji^ in Lemma |4.4[ Taking the union bound over all the bjS and the ZjS, we obtain that the 
probability that Learn fails on some input, when all the bjS and ZjS are valid is at most The 

lemma follows from Lemma 14.41 bv noting that "^^f^^,^^ = Q f ,,,1.5^^ ^ and is at most 2^) and 



L/24T + L/Sr < L/(2 + 5T). 
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Lemma 4.8. With probability at least 1 ■ 
and Q^{(J^{t)) = <J^ (t) for every t £ [k]. 



-, for every j = 1, . . . ,k' — 1 is a well-defined function 



Proof. Assume that the events in Lemmas 14.61 and 14.71 occur. Fix j £ [k' — 1]. We call a point 
al^bj + Q^^bk' a grid-j point. Call this grid point "genuine" if there exists t G [k] such that {t) = ti 
and {t) = t2, and "fake" otherwise. The distance between any two grid-j points is at least 
+ 0.4/T) (by Lemma l4.7p . So the probability there is a pair of genuine and fake grid-j points 



U3. 2 



vr 5T 



< 



whose projections on Zj are less than L/{T + 0.4) away is at most k ■ - arcsin(y j < k 
Therefore, with probability at least 1 — cj, the events in Corollarv 14.61 and Lemma 14.71 happen, and 
for all j £ [k' — 1], every pair of genuine and fake grid-j points project to points on zj that are at 
least L/{T + 0.4) apart. We condition on this in the sequel. 

Now fix j G [k' — 1] and consider any pair ti,t2 £ [k]"^. Let g be the grid-j point bj(xj-_^ + bya\^ 
We show that Q^{t2) = ii iff 5 is a genuine grid-j point. If g is genuine, let t be such that 
(T-'(t) = ti, {t) = t2. Let p' be the projection of p* on Span(6j, 6^'). By Lemma [47} we have that 



Hp' - fib < 



V2L 
2+5T- 



Also, there exists t' £ [k] such that |q!^, 



< 



Since zjp' 



implies that \z'jg 



t. _ A.1 I ^ _ I + |^t(y _ ^)| < (xg+lK, SO (^{t2) = tl. 



ZjP*^, this 



Now suppose g is fake but \zjg - a^,| < {V2 + l)L/{2 + 5T) for some t' £ [k]. Let t £ [k] be 



such that \a^, 



■ZjP^l 



— 2+5T - s' genuine grid point bjOp^j,^. (t)' 1"^]^' 



(^/2 + l)L/(2 + 5T), and hence \z]{g - g')\ < 



2(V2+1)L 
2+5T 



< 



0.4+T 



which is a contradiction. 



' a. 



< 



Proof of Theorem \4.1\ We condition on the fact that all the "good" events stated in Lemmas 13.21 
14.21 14-61 14.71 and 14.81 happen. The probability of success is thus 1 — 0(^)- The sample-size bounds 
follow from the description of the algorithm. For notational simplicity, let be the identity 
permutation, i.e., cj'^ (t) = t for all t £ [k]. So by Lemma 14.81 we have 0^{t) = cr^{t) for every 
j £ [k' - 1] and t £ k. 

For t = 1, 2, . . . , A:, define p^ = r + Ylf=i b] (p* 



Fix t£ [k]. Then 



We have 



\\p' 



< 



\\P -P II2 



\P 



■#lli + ll# 



p'Wi <2||p*-j5*||i <2 



< 2 



f + (p* — r 
f + n^(r -r) + {p 

^-f||2 + -||nA-n 



t 



Allop 



r)bj 



\\p -p \\l + \\p 



\\p 



r 2 



< 



12^ 



+ 



■P% 



)• 



8V25 



Also \\p* -p% < \\y2{b]p 



a 



o 



.k6 \ 



where the last equality follows from Lemma 14.71 Thus, \\p^ 



P*lli 



0{^^). Also, we have 



13 



\wt-wt\ = 0{^^^^) by Lemma [im Finally, note that 



k 



I ~ii II t ~t' II 

\w — ■w\\ima.x\\p —p 1 
t,t' 



Tran(i(;, P; w,P) < - mm{wt,Wt} max — + 

t=i 

1 

< - ( minjwt, wt} max — J)*!]! + lltL) — (max — ||i + max — J)*!]! 



2 L .J ^ iir- . ,i. ,1 ii^v 

t=i 

„ ^ , 2 \/M 

<max||p — p'lli + ||tf^ — ^flli • = 



5 The one-dimensional problem: learning mixture sources on [0,1] 

In this section, we supply the key subroutine called upon in step L2 of Algorithm Learn, which will 
complete the description of Algorithm [TJ We are given a A:-mixture source [w,tTx{P)) on [— ^, ^] • 
(Recall that Learn invokes the procedure for the mixture [w, T^a/2H{P)) where ||a||oo < H.) It is clear 
that we cannot in general reconstruct this mixture source with an aperture size that is independent 
of n, let alone aperture 2k — 1. However, our goal is somewhat different and more modest. We seek 
to reconstruct the A;-spike distribution {^w,E[iTx{P)]) , and we show that this can be achieved with 
aperture 2k — 1 (which is the smallest aperture at which this is information-theoretically possible). 

It is easy to obtain a {2k — l)-snapshot from {w,7Tx{P)) given a {2k — l)-snapshot from {w,P) 
by simply replacing each item i S [n] that appears in the snapshot by Xj. We will assume in the 
sequel that every constituent ■Kx{p*) is supported on [0, 1], which is simply a translation by ^. 

To simplify notation, we use 6 = (i?, (g^, . . . , q^)^ to denote the /c-mixture source on [0, 1], and 
(^?,a = (ai,...,afc)) to denote the corresponding fc-spike distribution, where G [0,1] is the 
expectation of for all i &[k]. We equivalently view {-d, a) as a fc-mixture source (t?, {f^ , . . . , f^)) 
on {0, 1}: each /* is a "coin" whose bias is /{ = a,. In Section [5.H we describe how to learn such 
a binary mixture source from its {2k — l)-snapshots (see Algorithm [2] and Theorem 15. 3p . Thus, if 
we can obtain {2k — l)-snapshots from the binary source (■!?, (/^, . . . , f^)) (although our input is 
6) then Theorem 15.31 would yield the the desired result. We show that this is indeed possible, and 
hence, obtain the following result (whose proof appears at the end of this section). 

Theorem 5.1. Let 6 = {{},{q^ , . . . ,q^)^ be a k-mixture source on [0,1], and ('!9,a) be the cor- 
responding k-spike distribution. Let t = miuj^j/ \aj — aji\. For any s < t and ip > 0, using 
3k2'^^s~^^ln{4k/ip) {2k — l)-snapshots from source 0, one can compute in polytime a k-spike dis- 
tribution ("!?, a) on [0, 1] such that Tran(??, a; ??, a) < 1024A;s-'^/(^'^) with probability at least 1 — i/j. 



5.1 Learning a binary fc-mixture source 

Recall that (■!?, (/^, . . . , /'^)) denotes the binary A:-mixture source, and Ui = fl is the bias of the i-th 
"coin". We can collect from each {2k — l)-snapshot a random variable < X < 2k — 1 denoting 
the number of times the outcome "1" occurs in the snapshot. Thus, 

k 

Pr[X = i] = ('^ - ^^,a]{l - a,r-'-\ (4) 

Our objective is to use these statistics to reconstruct, in transportation distance (see Sec- 
tion [2]2]), the binary source (i.e., the mixture weights and the k biases). Now consider the equiva- 
lent /c-spike distribution (■(?,«). The i-th moment, and (what we call) the i-th normalized binomial 
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moment (NBM) of this distribution are 



i=i i=i 

Up to the factors (^^~^) the NBMs are precisely the statistics of the random variable X (Eqn. S]) 
and so our objective in this section can be restated as: use the empirical NBM's to reconstruct the 
/c-spike distribution (t?, a). 

Let g{'d,a) = a))^^^ ""^ and v{^,a) = ('^i(^) o))^^o ^ denote the vector of the first 2k — 1 

moments and NBMs respectively of (-i?, a). For a positive integer 6, and a vector /? = (/3i, . . . , let 
Ab(/3) be the £x 6 matrix (A(/3))ii = (l-ft)''"^"^/?/ (with 1 < i < £ and < j < 6-1). Analogously, 
let V'b(/3) be the £ x 6 matrix {Vb{l3))ij = (with 1 < i < £ and < j < 6- 1). Let Pas be the 2k x 2k 
lower triangular "Pascal triangle" matrix: for < j < 2A; — 1 and j + 1 < i < 2k, Pasjj = (^^SjSi)- 
Then V2fc(a) = ^2fc(ct)Pas, !/(■!?, a) = '&A2kio:), and g{'d,a) = '&V2k{a) = z^(??, a)Pas. 

In our algorithm it is convenient to use the empirical ordinary moments, but what we obtain 
are actually the empirical NBM's, so we need the following lemma. 

Lemma 5.2. ||Pas||op < #/\/3. 



P roof. llPasll op < llPasll^ = Z'r!:=o ET=o il) ■ Since ET=o il) = (T) < 2^'"> < 

v^2fc-l r,2m ■ 

Our algorithm uses two input parameters r and ^ as input, and the empirical NBM vector D 
(or equivalently g). Since we infer (in the sampling limit) the locations of the k spikes exactly, 
there is a singularity in the process when spikes coincide. So we assume a minimum separation 
between spikes: r = miuj^j/ \aj — aj'\. (It is of course possible to simply run a doubling search for 
sufficiently small r, but the required accuracy in the moments, and hence sample size, does increase 
as T decreases.) We also assume a bound ^ on the accuracy of our empirical statistics. (When we 
utilize Theorem 15.31 to obtain Theorem 15.11 is a consequence, and not an input parameter). We 
require that 

||z>-z.(7?,a)||2 <e4-'=V3, C<r^'' (5) 

Theorem 5.3. There is a polytime algorithm that receives as input r, i^, an empirical NBM vector 
~ g j^2fc gQfigjyifig outputs a k-spike distribution {-d, a) on [0, 1] such that Tran(i?, a; d) < 

We first show the information-theoretic feasibility of Theorem 15.31 the transportation distance 
between two probability measures on [0, 1] is upper bounded by (a moderately-growing function of) 
the Euclidean distance between their moment maps. (To use Lemma 15.41 to prove Theorem 15.31 we 
have to show how to compute i) and a from g such that \\g—g{'d, a)\\2, and hence, \\g{'&, a)—g{^, ot)\\2 
is small.) 

Lemma 5.4. For any two (at most) k-spike distributions {'d,a) {'d,a) on [0,1], 



\g{^,a)-g{la)h > (^fc _ i|4;c28;.-5 " (TVan(^, a; ^, d)) 



4fc-2 



Lemma [5.4l can be geometrically interpreted as follows. The point g{'&, a) is in the convex hull of 
the moment curve and is therefore, by Caratheodory's theorem, expressible as a convex combination 
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of 2k points on the curve. However, this point is special in that it belongs to the collection of points 
expressible as a convex combination of merely k points of the curve. Lemma 15.41 shows that g{'d, a) 
is in fact uniquely expressible in this way, and that moreover this combination is stable: any nearby 
point in this collection can only be expressed as a very similar convex combination. We utilize the 
following lemma, which can be understood as a global curvature property of the moment curve; 
we defer its proof to Section 15.21 The moment curve plays a central role in convex and polyhedral 
geometry [8], but as far as we know Lemmas 15.41 and 15.51 are new, and may be of independent 
interest. 

Lemma 5.5. Let < /3i < . . . < < 1, £ € [k], and s = /3£+i — /3g. Let 7(x) = Yli=Qli^^ 
real polynomial of degree k evaluating to 1 at the points f3i, . . . , /3i and evaluating to at the points 
. . . ,/3,+i. Then EtoT? < ^^2^'^-^'^'^ . 

Proof of Lemma \5^[ Denote {ai, . . . , at} U {ai, . . . , dtk] by a = {ai, . . . , a/^} where «!<...< 
oiK. Define '^i = Ei:a,=s. " Ei:a,=a, "^3 ie[K]. Let ^ G be the row vector (^i, . . . .Ik)- 
Let ?7 = Tran('i?, a; ■(?, a). So we need to show that ||'!9V2fc(a)||2 > (2fc-i)^'^2»'°-5 ' suffices to 

show that \\dVK{a)h > (/^-i)/i^2^^-^ ' "l^^'^' 

There is an 1 < £ < K such that jX^^^i • (a^+i — a^) > r]/ {K — 1). Let 6 = Yli=i '^i'l without 
loss of generality 5 > 0, and note that 5 <1. Let s = a^+i — «£, so {K — l)5s > rj. 

Denote row i of a matrix Z by Zi^: and column j by Z^:j. We lower bound ||'(?Vft'(«)||2j by 
considering its minimum value under the constraints X^j^x "di = 5 and Xli^i '^i ~ 0- 

A vector y'^ = -dVKijy) minimizing \\y\\2 must be orthogonal to Vft-(a)i* — Vxictji'* if 1 < i < 
i' < £ or SI £ + 1 < i < i' < K . This means that there are scalars c and d such that Vft:(a)y = 
c{'}2^j=i Cj) + d(Yl,f=i+i where vector Cj G M.^ has a 1 in the j-th position and everywhere 

else. Therefore, y = cy + dy' , where 7 = Yl!'j=i{^K{p)~^)*j and 7' = Yl,f=i+ii^K(^)~^)*j- At the 
same time 

I K 

5 = ^'di = ??V/^(a)7 = y"^7 = c||7||2+d7'"^7 -5= ^ = ??VK(a)7' = y^' = c7"^7'+d||7'||2 
i=l i=e+i 

and hence, ||y||| = y • (07 + d'j') = (c — d)6. Solving for c, d, 

6\h + i\\l 



d 



i7iii-iiyiii-(7t-y)^ 



First we examine the numerator. Like any combination of the columns of Vxia)^^, 7 + 7' is the 
list of coefficients of a polynomial of degree K — 1, in the basis 1, x, . . . , x^^^. By definition, 
7 + 7' = which is to say that for every i, Vft-(a)» • (7 + 7') = 1- So the polynomial 

7 + 7' evaluates to 1 at every a^. It can therefore only be the constant polynomial 1; this means 
that (7 + 7')i = 1 if i = 0, and (7 + y')i = otherwise. Thus II7 + 7'||2 = 1. 

Next we examine the denominator, which we upper bound by II7II2 • ||7'||2- When interpreted as 
a polynomial, 7 takes the value 1 on a nonempty set of points ai, . . . ,«£ separated by the positive 
distance s = a^+i — ai from another nonempty set of points a^+i, . . . ,aK upon which it takes the 
value 0. Observe that if the polynomial was required to change value by a large amount within a 
short interval, it would have to have large coefficients. A converse to this is the inequality stated 
in Lemma 15.51 Using this to bound II7II2 and ||7'||2, and since 6s > t]/{K — 1), we obtain that 

2_. _ r?^^-^ 

I|y||2-(C d}d> ||^||2. ||^,||2 > ((^ _ i)224^-5s-2^+2)2 ^ - 1)'^^ 2^^ ' 
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We now define the algorithm promised by Theorem 15.31 and then prove the theorem. To give 
some intuition, suppose first that we are given the true moment vector g{'d, a) = i?V2fc(a). Observe 
that there is a common vector A = (Ao,...,Afc)^ of length A; + 1 that is a dependency among 
every k + 1 adjacent columns of V2fc(a). In other words, letting A = A(A) denote the 2k x k 
matrix with Ajj = Aj_j (for < i < 2k, < j < k and with the understanding A^ = for 
i i {0,... ,A;}), V2k{oL)k = 0. Thus g{^,a)A = W2k{a)I^ = 0. Overtly this is a system of 2k 
equations to determine A. But we eliminate the redundancy in A by forming the k x {k + \) 
matrix G = G{g{'&, a)) defined by Gij = g{'d, for i = 0, . . . , A; — 1 and j = 0, . . . ,k, and then 

solve the system of linear equations GX = to obtain A. This system does not have a unique 
solution, so in the sequel A will denote a solution with A^ = 1. For each i = 1, . . . ,k, we have 
(^2fe(a)A)^-^ = Yl£=o^i'^i^ — 0- This implies that we can obtain the values by computing the 

roots of the polynomial Px{x) := J2e=o ^ex^. Once we have the a^'s, we can compute by solving 
for y the system of linear equations yV2k{a) = g{i}, a). 

Of course, we are actually given g rather than the true vector g{'&,a). So we need to control 
the error in estimating first a and then {}. The learning algorithm is as follows. 

Algorithm 2. Input: parameters and empirical moments g such that \\tg — g{'d,a)\\2 < ^. 
Output: a k-spike distribution (■(?, a) 

Bl. Solve the minimization problem: 

minimize ||a;||i s.t. ||G(g)x||i < 2'=A:^, xj, = 1 (P) 

which can be encoded as a linear program, to obtain a solution A. Observe that since G{g) has k -\- 1 
columns and k rows, there is always a feasible solution. 

B2. Let ai, . . . , ttfe be the (possibly complex) roots of the polynomial P^. Thus, we have V2fe(a)A(A) ~ 0. 
We map the roots to values in [0, 1] as follows. Let e — ^{2k£^)^/^. First we compute values di, . . . ,afc 
such that \ai — ai\< e for every i, in time poly(log(i)) , using Pan's algorithm [331 Theorem l.ljl. We 
now set o-i = max{0, min{Re(Q;i), 1}}. 

B3. Finally, we set d to be the row-vector y that minimizes ||yV2fe(Q!) — g\\2 subject to ||y|| i = 1, y > 0. Note 
that this is a convex program. 

We now analyze Algorithm [2] and justify Theorem 15.31 Recall that r = min^yj' \aj — Oiji\. We 
need the following lemma, whose proof appears in Section! 



Lemma 5.6. The weights d satisfy \\W2k{a) - cjh < \\g{'&,a) - gh + • (2/c^)i/'=. 

Proof of Theorem \5.3[ We call Algorithm [2] with g = PPas. By Lemma 15.21 we obtain that \\g - 
^(t?, q)||2 < and by Lemma [5.61 we have that ||(7(T?,a) — i?V2fc(a)||2 < 2||5f(i?, q) — g\\2 + ^ 

(8A;)'^/^(2A;,^)^'''^. Coupled with Lemma 15.41 and since ^ < r^'^, we obtain that 



Tran(i?, a; 1?, q) < {2k - l)^''2^''-^\\g{'d,a) - g{^,a)\\2 
{2k - 1)4^28^-5 (2e + {2k^) 



1 

4fe-2 



< 



r 



1 

4fc-2 



< 



1 

4fc-2 



< 1024 • /c^i 



^The theorem requires that the complex roots he within the unit circle and that the coefficient of the highest-degree 
term is 1; but the discussion following it in [35] shows that this is essentially without loss of generality. 
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Proof of Theorem \5.1[ We convert 6 to the corresponding binary source (■(?, (/^, . . . , Z'^)) by ran- 
domized rounding. Given a {2k — l)-snapshot z = {zi, . . . , Z2k-i) £ [0,1]^'^"^ from 0, we obtain 
a {2k — l)-snapshot from the binary source as fohows. We choose 2k — 1 independent values 
ai, . . . ,a2k-i uniformly at random from [0,1] and set = 1 if Zj > aj and otherwise for 
all i E [2k — 1]. Note that if is the constituent generating the {2k — l)-snapshot z, then 
Pr[Xj = l\q-'] = FilXilq^] = aj, and so Xi, . . . ,X2k-i is a random {2k — l)-snapshot from the above 
binary source. 

Now we apply Theorem 15.31 setting ^ = s^^. Let u be the empirical NBM- vector obtained from 
the {2k — l)-snapshots of the above binary source (i.e., Ui = {^^^^) • (frequency with which the 
{2k — l)-snapshot has exactly i Is)). The stated sample size ensures, via a Chernoff bound, that 
Pr [Iz/j — i/(i?, a)i\ > ^g^] < ^ for alH = 0, . . . , 2fe — 1. Hence, with probability at least 1 — ip, we 
have - iy{'d,a)\\2 < V2k ■ ||P - i/(i?, a)||oo < ^4-^/\/3. ■ 

5.2 Proof of Lemma 15.51 

There are two easy cases to dismiss before we reach the more subtle part of this lemma. The first 
easy case is £ = 1. In this case 7 is a single Lagrange interpolant: 7(x) = 11^=2 ^i-^- • -'^'-'^ < i < k 
let ef{l32, ■ ■ ■ ,/3k+i) be the i'th elementary symmetric mean, 

e?(/32,...,/3.+i) = -iy Yl U^^ 

SC{2,...,n+l}:\S\=ijeS 

and observe that for alH, < ef (/32, . . . , /3k+i) < 1. Now 

^(^) = (n i2(-^r"he::-^{f^2, . . . ,/3.+i)x^ 

j=2 i=0 ^ ^ 

So Eto = {mi Eto (D<(/5^2, . . . , /3.+i))^ < .-^^ Eto cif = 

The second easy case is £ = k; this is almost as simple. Merely note that the above argument 
applies to the polynomial 1 — 7, so that we have only to allow for the possible increase of I70I by 
1. Hence Eto 7^ < 4(1'^).-^^ 

We now consider the less trivial case of 1 < £ < k. The difficulty here is that the La- 
grange interpolants of 7 may have very large coefficients, particularly if among /3i, . . . , or among 
. . . , there are closely spaced roots, as well there may be. We must show that these large 
coefficients cancel out in 7. 

The trick is to examine not 7 but d"f/dx. The roots of the derivative interlace the two sets on 
which 7 is constant, which is to say, with /J^ < . . . < /S^-i denoting the roots of d'y/dx, that for j < i, 
(3j < (3j < Pj+i, and for j > i, < < /3j+2- In particular, none of the roots fall in the interval 
(/3£, /?^+i). For some constant C we can write d'y/dx = CY[jZo{x — l^j) (with sign(C) = (— 1)-'^+'^^^). 

Observe that f^^'+' ^{x) dx = -1. So {-l)'^+^-^/C = //;+'(" 1)""^ 11^=0 (^^ " Pj) dx. Observe 
that if for any j < I, j3'- is increased, or if for any j > j3'- is decreased, then the integral decreases. 

So (-l)i+'^-7C > fp'+\-lY-\x - f3eY~^{x - Pi+iY~^ dx. This is a definite integral that can 
be evaluated in closed form: 

/ {-ir-\x - PtY-\x - Pt+iT-' dx = (/3,+i - P,Y{1 -1)\{k- 1)1/ ti\ . 
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Hence, (—1)^"'"'^ < s'^(e-i)\{K-ey. • "The sum of squares of coefficients of ^ is 

C^E*t"o r7^)^(e»^"H/5i>---,/?Li))^ < C^Ck-i)- Integration only decreases the magnitude of 
the coefficients, so the same bound apphes to 7, with the exception of the constant coefficient. The 
constant coefficient can be bounded by the fact that 7 has a root in (0, 1), and that in that interval 
the derivative is bounded in magnitude by CY^'^~q {^~^) = C • 2^^. So I70I < C • 2**. Consequently, 
E»=o is at most 

which completes the proof of the lemma. ■ 



2k 



+ 2 



2k 



5.3 Proof of Lemma 15.61 

Recall that G = G{g{'d, a)) is the kx (A; + 1) matrix defined by Gij = g{'&, for i = 0, . . . , A; — 1 
and j = 0, . . . ,k; A is such that GX = and = 1; A = A(A) is the 2k x k matrix with Ajj = Aj_j 
(for < i < 2k, < j < k with the understanding A^ = for £ ^ {0, . . . , /c}; and Px{x) is the 
polynomial J2e=o^^^^- Vk,y2k, to denote Vfc(a), V2fe(«) respectively, and Vk,V2k,G,A to 

denote Vk{a) ,V2kia) , G{g) , A{X) respectively. We abbreviate g{i},a) to g. 

Lemma 5.7. // - g\\2 < t then \\GX\\i < 2''+^k£,. 

Proof. First, observe that GX = GX + {G-G)X = {G-G)X. Also ||A||2 < ||A||i = nLi(l+«i) < 
The last two inequalities follows since Px{x) = Y[i=ii^ ~ Q'i)) and P\{—1) = (— l)'^||A||i. So for any 
i = l,...,k, \{G-G)i- X\ < ||A||2||Gj - Gih < 2^^. Thus, A is a feasible solution to ([£]), which 
implies that ||A||i < 2^. We have ||GA||i < UGAHi + ||(G-G)A||i < 2^ki + \\{G - G)X\\i. For any 
i = l,...,k,\{G-G),-X\< IIGi-G.llallAlla < 2^i, so ||GA||i < 2^+^ki. ■ 

Lemma 5.8. For every ai, i = 1, . . . , k, there exists a a{i) £ {1, . . . , k} such that t?i|ai — < 

Proof. Since ||GA||2 < 2^^^k^ (by Proposition 15. 7p . we have equivalently that the ||.||2 norm of 
gA = '&V2kA is at most 2'^+^A;^. We may write ??V2fcA as 



/ P-^{ai) aiP^(ai) ••• a'^-^P-^iai) \ 



W2kA = ^ 



Pl{a2) a2P\{a2) ••• 02 ^Pi{a2 



\Pl{ak) akP-^{ak) ■■■ ^P-^{ak) J 



which is equal to i9'Vfc(a) where ■(?' = [-diP^lai), • • • , '!9fcPj^(ajt)) . Thus, we are given that ||'!9'Vfc||2 < 
2'=+ifc^. 

Let (7*)t = (argmiUj^giRfc.j^^i ||yV5c||2) Vfc. Then, we also have ||i9'Vfc||2 > maxj |'!9^|||7*||2. Note 
that 7* must be orthogonal to (Vfc)j=K for all j 7^ i, and (^4)1* 7* = ||7*|||. (Recall that Zi^: denotes 
row i of a matrix Z.) Let Qi{x) = Yle=o 'le^^ ■ Then, Qi{x) = ||7'||2 Wjj^i a -a • ^^^o, since the 
coefficients of Qi{x) have alternating signs, we have 

iQ.(-i)i = iifiii = iiyiiin^^- 

-h^ \ai - ajl 
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Hence, ||7*||2 > Yij^i ^'^i+a^ ^ ■ obtain the lower bound 

The last inequality follows since complex roots occur in conjugate pairs, so if = a + bi is complex, 
then there must be some I' such that a£' = a — bi and therefore, 

\ai - Ojl = {{oi - of + b^) ■ JJ \ai - aj\ > (oj - of ■ |aj - aj\. 

Now, we claim that la, — Re(Qj)| > — dj| — e| foreveryj. If both Re(Q;j) and Re(Qj) lie in [0,1], 
or both of them are less than 0, or both are greater than 1, then this follows since \aj — ctj\ < e 
and Ui £ [0, 1]. If Re(aj) ^ [0, 1] but Re(aj) e [0, 1], or if Re(aj) G [0, 1] but Re{aj) ^ [0, 1], then 
this again follows since \aj — 0(j\ < £• Combining everything, we get that 

This implies that for every i = 1, . . . ,k, there exists a{i) G {1, . . . ,k} such that "dilai — Q!o-(j)l ^ 
^■{2kcf' + e. U 

We can now wrap up the proof of Lemma \5M Let rj = ^ ■ ^^'^ . We will bound ||??V2A; — g\\2 
by exhibiting a solution y G [0,1]*^, ||y||i = 1 such that \\yV2k — 5II2 < \\g — g\\ + A;(8A;)^/^r/. Let 
a be the function whose existence is proved in Lemma [5T8l For j = 1, . . . , k, set yj = X]j.o-{j)=j ''^i 
(if = 0, then yj = 0). We have \\yV2k - gh < \\g - sIb + lb - yVzkh- We expand 

5 - yV2k = ^V2k - yV2k = Ei=i ^i{{V2k)i* - {V2k)a{i)*) For every i, 

2k-l 

mv2k)^. - {V2kU^).\\^ = ^1 E - "'^w)' ^ • • 

Therefore, \\g - yV2k\\2 < kiSkf^'^r]. ■ 
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A Sample-size dependence of jSl, |4], [5] on n for ^i-reconstruction 



We view P = {p^, . . . ,p^) as an n x matrix. Recall that r = Ylt=i "^tPtPt, ^ = Ylt=i "^tijp^ — 
r){p^ — r)\ and M = rr'^ + A. Let ifmax := Ta.SLX.tWt- We consider isotropic fc-mixture sources, 
which is justified by Lemma 13.31 So ^ < < ;| for all i G [n\. Note that ||r||i and ||r||| are 
both 0(^)- It will be convenient to split the width parameter C, into two parameters. Let (i) 

= miup^qgp^p^q \\p — q\\2] and (ii) be the smallest non-zero eigenvalue of A. Then, the width 

of {w,P) is C = maxj^i, ^2}- We use (Ti{Z) to denote it i-th largest singular value of a matrix Z. 
If Z has rank its condition number is given by k{Z) := ai{Z)/a£{Z). For a square matrix Z 
with real eigenvalues, we use Xi{Z) to denote the i-th largest eigenvalue of Z. Note that if Z is an 
n X k matrix, then ai{Z)'^ = Xi{ZZ^) = Xi{Z^ Z) for all i = 1, . . . , A;. Also the singular values of 
ZZ^ coincide with its eigenvalues, and the same holds for Z^ Z. 

We now proceed to evaluate the sample-size dependence of O IH [5] on n for reconstructing 
the mixture constituents within ^i-distance e. Since these papers use different parameters than we 
do, in order to obtain a meaningful comparison, we relate their bounds to our parameters Ci)C2; 
we keep track of the resulting dependence on n but ignore the (polynomial) dependence on other 
quantities. We show that the sample size needed is at least r2(^j, with the exception of Algorithm 
B in [3], which needs ^(^) samples. As required by [3llU[5], we assume that P has full column 
rank. It follows that M has rank k and A has rank k — 1. The following inequality will be useful. 

Proposition A.l. Let D = diag((ii, . . . ,dk) where di > d2 > ■ ■ ■ > dk > 0. Then \k{PDP'^) > 
duXk{PP^) = dkak{Pf. 

Comparison with [5j . The algorithm in [5] requires also that P be p-separable. This means 
that for every t G \k\, there is some i G [n] such that p\'> p and p\ = Q for all t' ^ t. This has the 
following implications. For any t,t' G [k], t ^ t' , we have — p*'||2 ^ V^Pj so > V^p. We can 
write pip = Y + Z, where Y is a PSD matrix, and Z is a diagonal matrix whose diagonal entries 
are at least p'^. So Afe(ptp) = \,,{PP'^) > p2. Therefore, 

— + Ml = Afc(^) + Ml > Xk{M) > W^in ■ p^ 

n 

where the first inequality follows from Lemma [2.21 and the second from Proposition lA.il It follows 
that p = 0(-^) . The bound in [5] to obtain ^oo error e is (ignoring dependence on other quantities) 

O(p^). So setting e = ^ to guarantee £i-error at most e and plugging in the above upper bounds 

on />, we obtain that the sample size is f](^). 

Comparison with [3]. The sample size required by [3] for the latent Dirichlet model for 
obtaining ^2 error e is ^{^i^^^p^) ■ Proposition I A. II vields Afc(M) > Wmin • (^k{P)'^ and as argued 
above, Afc(M) < Xk{A) + ||r||2 = 0(i). So ak{Pf = 0{^). Setting e = for ii error e, this 
yields a bound of r2(^j. 

Comparison with [4]. Algorithm A in [3] requires sample size ^ ( o-^ {P)^<yk {M)'^e'^ ) recover 
each to within £2-distance emax„gp llplb- Since maxpgp llplU < — '^—7= due to isotropy, we can 

^minV^ 

set e = to obtain £i-error e. Since (Tfc(P)^ and crfc(M) = Afc(M) are both O(^), we obtain a 

bound oin{^). 

Algorithm B in [4] uses sample size q(^k{P)^ / ■ (Tfc(M)^e^)^ to recover each to within £2- 
distance emaxpgp ||p||2- Clearly k{P) > 1. Again, setting e = , this yields a sample size of 
fi(^) for ii error e. 
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